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Abstract 



We conclude a sequence of work by giving near-optimal sketching and streaming algorithms 
for estimating Shannon entropy in the most general streaming model, with arbitrary inser- 
tions and deletions. This improves on prior results that obtain suboptimal space bounds 



in the general model, and near-optimal bounds in the insertion-only model without sketch- 
ing. Our high-level approach is simple: we give algorithms to estimate Renyi and Tsallis 
entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of 
our estimates is proven using approximation theory arguments and extremal properties of 
Chebyshev polynomials, a technique which may be useful for other problems. Our work also 

£^ , yields the best-known and near-optimal additive approximations for entropy, and hence also 

r/J ' for conditional entropy and mutual information. 

.O, ' 

1 Introduction 

j> ■ Streaming algorithms have attracted much attention in several computer science communities, 

notably theory, databases, and networking. Many algorithmic problems in this model are now 
well-understood, for example, the problem of estimating frequency moments [1, 2, 10, 18, 32, 35]. 

t^j- ■ More recently, several researchers have studied the problem of estimating the empirical entropy 

T-4-' . of a stream [3, 6, 7, 12, 13, 37]. 

Motivation. There are two key motivations for studying entropy. The first is that it is a 
(^ , fundamentally important quantity with useful algebraic properties (chain rule, etc.). The second 

stems from several practical applications in computer networking, such as network anomaly 
detection. Let us consider a concrete example. One form of malicious activity on the internet 
is port scanning, in which attackers probe target machines, trying to find open ports which 
could be leveraged for further attacks. In contrast, typical internet traffic is directed to a small 
number of heavily used ports for web traffic, email delivery, etc. Consequently, when a port 
scanning attack is underway, there is a significant change in the distribution of port numbers in 
the packets being delivered. It has been shown that measuring the entropy of the distribution 
of port numbers provides an effective means to detect such attacks. See Lakhina et al. [19] and 
Xu et al. [36] for further information about such problems and methods for their solution. 
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Our Techniques. In this paper, we give an algorithm for estimating empirical Shannon entropy 
while using a nearly optimal amount of space. Our algorithm is actually a sketching algorithm, 
not just a streaming algorithm, and it applies to general streams which allow insertions and 
deletions of elements. One attractive aspect of our work is its clean high-level approach: we 
reduce the entropy estimation problem to the well-studied frequency moment problem. More 
concretely, we give algorithms for estimating other notions of entropy, Renyi and Tsallis entropy, 
which are closely related to frequency moments. The link to Shannon entropy is established by 
proving bounds on the rate at which these other entropies converge toward Shannon entropy. 
Remarkably, it seems that such an analysis was not previously known. 

There are several technical obstacles that arise with this approach. Unfortunately, it does 
not seem that the optimal amount of space can be obtained while using just a single estimate of 
Renyi or Tsallis entropy. We overcome this obstacle by using several estimates, together with 
approximation theory arguments and certain infrequently-used extremal properties of Chebyshev 
polynomials. To our knowledge, this is the first use of such techniques in the context of streaming 
algorithms, and it seems likely that these techniques could be applicable to many other problems. 

Such arguments yield good algorithms for additively estimating entropy, but obtaining a 
good multiplicative approximation is more difficult when the entropy is very small. In such a 
scenario, there is necessarily a very heavy element, and the task that one must solve is to estimate 
the moment of all elements excluding this heavy element. This task has become known as the 
residual moment estimation problem, and it is emerging as a useful building block for other 
streaming problems [3, 5, 10]. To estimate the a residual moment for a S (0, 2], we show that 
0(e _2 logm) bits of space suffice with a random oracle and 0(e _2 log m) bits without. This 
compares with existing algorithms that use 0(e~ 2 log m) bits for a = 2 [11], and 0(e~ 2 logm) 
for a = 1 [10]. No non-trivial algorithms were previously known for a {1,2}. Though, the 
previously known algorithms were more general in ways unrelated to the needs of our work: they 
can remove the k heaviest elements without requiring that they are sufficiently heavy. 

Multiplicative Entropy Estimation. Let us now state the performance of these algorithms 
more explicitly. We focus exclusively on single-pass algorithms unless otherwise noted. The first 
algorithms for approximating entropy in the streaming model are due to Guha et al. [13]; they 
achieved 0(e~ 2 + logra) words of space but assumed a randomly ordered stream. Chakrabarti, 
Do Ba and Muthukrishnan [7] then gave an algorithm for worst-case ordered streams us- 
ing 0(e~ 2 log 2 m) words of space, but required two passes over the input. The algorithm of 
Chakrabarti, Cormode and McGregor [6] uses 0(e -2 logm) words of space to give a multiplica- 
tive 1 + s approximation, although their algorithm cannot produce sketches and only applies to 
insertion-only streams. In contrast, the algorithm of Bhuvanagiri and Ganguly [3] provides a 
sketch and can handle deletions but requires roughly 0(e _3 log m) words 1 . 

Our work focuses primarily in the strict turnstile model (defined in Section 2), which allows 
deletions. Our algorithm for multiplicatively estimating Shannon entropy uses 0(e _2 logm) 
words of space. These bounds are nearly-optimal in terms of the dependence on e, since there 
is an Cl(e~ 2 ) lower bound even for insertion-only streams. Our algorithms assume access to 
a random oracle. This assumption can be removed through the use of Nisan's pseudorandom 
generator [23], increasing the space bounds by a factor of O(logm). 

Additive Entropy Estimation. Additive approximations of entropy are also useful, as they 
directly yield additive approximations of conditional entropy and mutual information, which 
cannot be approximated multiplicatively in small space [17]. Chakrabarti et al. noted that since 

X A recent, yet unpublished improvement by the same authors [4] improves this to 0(e~ 3 log 3 



Shannon entropy is bounded above by logm, a multiplicative (1 + (e/logm)) approximation 
yields an additive e- approximation. In this way, the work of Chakrabarti et al. [6] and Bhuvana- 
giri and Ganguly [3] yield additive e approximations using 0(e~ 2 log m) and 0(e -3 log m) 
words of space respectively. Our algorithm yields an additive e approximation using only 
0(e~~ 2 logm) words of space. In particular, our space bounds for multiplicative and additive 
approximation differ by only log logm factors. Zhao et al. [37] give practical methods for addi- 
tively estimating the so-called entropy norm of a stream. Their algorithm can be viewed as a 
special case of ours since it interpolates Shannon entropy using two estimates of Tsallis entropy, 
although this interpretation was seemingly unknown to those authors. 

Other Information Statistics. We also give algorithms for approximating Renyi [26] and 
Tsallis [33] entropy. Renyi entropy plays an important role in expanders [15], pseudorandom 
generators, quantum computation [34, 38], and ecology [22, 27]. Tsallis entropy is a important 
quantity in physics that generalizes Boltzmann-Gibbs entropy, and also plays a role in the 
quantum context. Renyi and Tsallis entropy are both parameterized by a scalar a > 0. The 
efficiency of our estimation algorithms depends on a, and is stated precisely in Section 5. 

A preliminary version of this work appeared in the IEEE Information Theory Workshop [14]. 

2 Preliminaries 

Let A = (A\, . . . , A n ) 6 Z n be a vector initialized as which is modified by a stream of m 
updates. Each update is of the form (i,v), where i € [n] and v G {— M, . . . ,M}, and causes 
the change A{ <— A{ + v. For simplicity in stating bounds, we henceforth assume m > n and 
M = 1; the latter can be simulated by increasing m by a factor of M and representing an update 
(i,v) with \v\ separate updates (though in actuality our algorithm can perform all \v\ updates 
simultaneously in the time it takes to do one update). The vector A gives rise to a probability 
distribution x = (xi, . . . ,x n ) with xi = \Ai\/ 1 1 ^4 Hi- Thus for each i either x\ = or x\ > 1/m. 

In the strict turnstile model, we assume Ai > for all % £ [n] at the end of the stream. In the 
general update model we make no such assumption. For the remainder of this paper, we assume 
the strict turnstile model and assume access to a random oracle, unless stated otherwise. Our 
algorithms also extend to the general update model, typically increasing bounds by a factor of 
O(logm). As remarked above, the random oracle can be removed, using [23], while increasing 
the space by another O(logm) factor. When giving bounds, we often use the following tilde 
notation: we say f(m,e) = 0(g(m,e)) if f(m,e) = 0(g(m,£)(log\ogm + log(l/e))°( 1 >). 

We now define some functions commonly used in future sections. The a th norm of a vector is 
denoted \\-\\ a . We define the a th moment as F a = Y^i=i\Ai\ a = ||-4||". We define the a th Renyi 
entropy as H a = log(||x||°)/(l — a) and the a th Tsallis entropy as T a = (1 — ||cc||°)/(a — 1). 
Shannon entropy H = H i is defined by H = — YH=i x i logx*. A straightforward application of 
l'Hopital's rule shows that H = liniQ,^! H a = lim a ^i T a . It will often be convenient to focus on 
the quantity a — 1 instead of a itself. Thus, we often write H(a) = Hi +a and T(a) = Ti +a . 

We will often need to approximate frequency moments, for which we use the following: 

Fact 2.1 (Indyk [16], Li [20], [21]). There is an algorithm for multiplicative approximation of 
F a for any a & (0,2]. The algorithm needs 0(e~ 2 \ogm) bits of space in the general update 

model, and O I ( ^ ' -\ — ) logm J bits of space in the strict turnstile model. 

For any function a i— » f(a), we denote its k derivative with respect to a by /' '(a). 



3 Estimating Shannon Entropy 

3.1 Overview 

We begin by describing a general algorithm for computing an additive approximation to Shannon 
entropy. The remainder of this paper describes and analyzes various details and incarnations of 
this algorithm, including extensions to give a multiplicative approximation in Section 3.4. We 
assume that m, the length of the stream, is known in advance. Computing {{A^ is trivial since 
we assume the strict turnstile model at present. 

Algorithm 1. Our algorithm for additively approximating empirical Shannon entropy. 

Choose error parameter e and k points {yo, . . . , yk\ 
Process the entire stream: 

For each i, compute F\ +Vi , a (1 + e)-approximation of the frequency moment F\ +yi 
For each i, compute H( Vi ) = - \og{F 1+y J\\A\\\ +m )/ yi and f(y t ) = (l - F 1+yi / \\A\\\ +m ) / Vl 
Return an estimate of H(0) or T(0) by interpolation using the points H{yi) or T(yi) 

3.2 One-point Interpolation 

The easiest implementation of this algorithm is to set k = 0, and estimate Shannon entropy H 
using a single estimate of Renyi entropy H(yo). We choose yo = 0(e/(lognlogm)) and e = e-yo- 
By Fact 2.1, the space required is 0(e~ 3 log n log m) words. The following argument shows this 
gives an additive 0(e) approximation. With constant probability Fi +yo = (1 ±e)Fi +yo . Then 

H(vo) = y Q log (pjjfe) = ^ log (d ± 0(e)) ± *}+*») = H(y ) ±o(±)=H± 0(e). 

(3.1) 
The last equality follows from the following theorem, which bounds the rate of convergence of 
Renyi entropy towards Shannon entropy. A proof is given in Appendix A.l. 

Theorem 3.1. Let x G M. n be a probability distribution whose smallest positive value is at least 
1/m, where m > n. Let < e < 1 be arbitrary. Define \i = e/(41ogm), v = e/(4 logn logm), 
a = l + ///(l61og(l//x)), and/3= 1 + i//(l61og(l/i/)). Then 

1 < -pf- < 1 + e and < Hi - Hp < e. 

H a 

3.3 Multi-point Interpolation 

The algorithm of Section 3.2 is limited by the following tradeoff: if we choose the point yo 
to be close to 0, the accuracy increases, but the space usage also increases. In this section, 
we avoid that tradeoff by interpolating with multiple points. This allows us to obtain good 
accuracy without taking the points too close to 0. We formalize this using approximation 
theory arguments and properties of Chebyshev polynomials. 

The algorithm estimates the Tsallis entropy with error parameter e = e/(12(k + l) 3 logm) 
using points yo,Ui, ■ ■ ■ ,Vk, chosen as follows. First, the number of points is k = log(l/e) + 
log log to. Their values are chosen to be an affine transformation of the extrema of the k 
Chebyshev polynomial. Formally, set £ = l/(2(k + 1) logm) and define the map / : M. — > R by 

f(v) = (fc2 ^'^ + l (fc2 + 1) ' thendefine V* = /(cos(ivrA)). (3.2) 



The correctness of this algorithm is proven in Section 3.3.2. Let us now analyze the space 
requirements. Computing the estimate F\+ yi uses only 0(e~ 2 / log m) words of space by Fact 2.1 
since |j/j| < 1/(2(A;+1) log m) for each i. By our choice of k = 0{\) and e, the total space required 
is 0(e~ 2 logm) words. 

We argue correctness of this algorithm in Section 3.3.2. Before doing so, we must mention 
some properties of Chebyshev polynomials. 

3.3.1 Chebyshev Polynomials 

Our algorithm exploits certain extremal properties of Chebyshev polynomials. For a basic 
introduction to Chebyshev polynomials we refer the reader to [24, 25, 28]. A thorough treatment 
of these objects can be found in [29]. We now present the background relevant for our purposes. 

Definition 3.2. The set Vk consists of all polynomials of degree at most k with real coefficients. 
The Chebyshev polynomial of degree k, Pk(x), is defined by the recurrence 

'l, k = 

Pk{x) = I x, k = l 

2xP k -i{x) - P k - 2 (x), k>2 

and satisfies |Pfe(a;)| < 1 for all x € [—1,1]. The value |Pfe(x)| equals 1 for exactly k + 1 values 

of x in [— 1, 1]; specifically, Pk(r]j,k) = ( — 1) J f° r < j < k, where rjj t k = cos(J7r/k). The set Ck 

is defined as the set of all polynomials p £V k satisfying maxo<j<fc |p(^,A;)| < 1- 

Fact 3.3 (Extremal Growth Property). If p € C k and \t\ > 1, then \p(t)\ < \P k (t)\. 

Proof. See [29, Ex. 1.5.11] or Rogosinski [30]. ■ 

Fact 3.3 states that all polynomials which are bounded on certain "critical points" of the 
interval I = [—1,1] cannot grow faster than Chebyshev polynomials once leaving I. 

3.3.2 Correctness 

To analyze our algorithm, let us first suppose that our algorithm could exactly compute the 
Tsallis entropies T(yi) for < i < k. Let p be the degiee-k polynomial obtained by interpolating 
at the chosen points, i.e., p(yt) = T(yi) for < i < k. The algorithm uses p(0) as its estimate 
for T(0). We analyze the accuracy of this estimate using the following fact. Recall that the 
notation g^ k > denotes the fc th derivative of a function g. 

Fact 3.4 (Phillips and Taylor [25], Theorem 4.2). Let yo,yi, ■■ ■ ,Vk be points in the interval 
[a, b]. Let g : K — ► R be such that g' 1 ^, . . . , g( > exist and are continuous on [a, b], and g( k+1 > 
exists on (a, b). Then, for every y G [a,b], there exists £ y € (a, b) such that 

( k \ n( k+l )(f \ 

g(y)-p(y) = ]]>-y*r 



(fc + l)! 

\i=0 / y ' 

where p(y) is the degree-A: polynomial obtained by interpolating the points (yi,g(yi)), < i < k. 

To apply this fact, a bound on \T^ k+1 '(y)\ is needed. It suffices to consider the interval 
[—£,0), since the map / defined in Eq. (3.2) sends —1 i— > — £ and 1 i— > —£/(2k 2 + 1), and hence 
Eq. (3.2) shows that yi G [— £, 0) for all i. Since £ = l/(2(k + l)logm), it follows from the 
following lemma that 

iro+ifaoi < 41 °^W g vo<,<.. (3.3) 



Lemma 3.5. Let e be in (0, 1/2]. Then, \T { - k \- (A . +1) £ logm )| < 4 \og k (m)H/{k + 1). 
By Fact 3.4 and Eq. (3.3), we have 

. k+1 Alog k+1 (m)H 



\T(0)-p(0)\ < \£\ f 



(fc + l)!(fc + 2) 

1 4 1og fc+1 (m)i? 



2 fc + 1 log fc+1 (™) (fc + l)!(fc + 2) 

s Flwi £ !• < 3 ' 4 > 

since 2 = (logm)/£ and H < logm. This demonstrates that our algorithm computes a good 
approximation of T(0) = H, under the assumption that the values T(j/j) can be computed 
exactly. The remainder of this section explains how to remove this assumption. 

Algorithm 1 does not compute the exact values T(yi), it only computes approximations. The 
accuracy of these approximations can be determined as follows. Then 
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(3.5) 


Now recall that 
E 7=i x] +Vi < 2 


Xj > 1/m 
Z^j=l x j - 


for each \ 
- 2. Since 


i and yi > —£, 
e/£ = e/(6k 2 ), 


so that x\ % 
we have 
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= m) 
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Thus 



T( Vi ) < T{ Vi ) < T( yi )+e/(3k 2 ). (3.6) 

Now let p(x) be the degree-fc polynomial defined by p(yi) = T(yi) for all < % < k. Then 
Eq. (3.6) shows that r(x) = p(x) — p(x) is a polynomial of degree at most k satisfying ^(j/i)! < 
e/(3fc 2 ) for all < i < k. 

Let P : R — > R be the Chebyshev polynomial of degree k, and let Q(y) = P[f~ l (y)) be an 
affine transformation of P. Then the polynomial r'(y) = (3k 2 /e) ■ r(y) satisfies \r'(yi)\ < \Q(yi)\ 
for all < i < k. Thus Fact 3.3 implies that |r'(0)| < |Q(0)|. By definition of Q, Q(0) = 
P(/ _1 (0)) = P(l + 1/k 2 ). The following lemma shows that this is at most e 2 . 

Lemma 3.6. Let P be the A; th Chebyshev polynomial, k > 1, and let x = 1 + /c _c . Then 

i^)i < n( i+ l) ^ e2fc2_c - 

Thus |r'(0)| < e 2 and |r(0)| < e/2 since k > 2. To conclude, we have shown |p(0) — p(0)| = 
l r (0)| < e/2. Combining with Eq. (3.4) via the triangle inequality shows \p(0) — H\ < e. 

3.4 Multiplicative Approximation of Shannon Entropy 

We now discuss how to extend the multi-point interpolation algorithm to obtain a multiplicative 
approximation of Shannon entropy. The main tool that we require is a multiplicative estimate of 
Tsallis entropy, rather than the additive estimates used above. Section 5 shows that the required 
multiplicative estimates can be efficiently computed; Section 4 provides tools for doing this. 

The modifications to the multi-point interpolation algorithm are as follows. We set the 
number of interpolation points to be k = max{5,log(l/e)}, then argue as in Eq. (3.4) to 
have |T(0) — p(0)| < sH/2, where p is the interpolated polynomial of degree k. We then 



use Algorithm 1, but we compute T(yi) to be a (1 + ^-multiplicative estimation of T(yi) in- 
stead of an e-additive estimation by using Theorem 5.6. By arguing as in Eq. (3.6), we have 
T{vi) < f( yi ) < T( yi ) + eT( yi )/(3k 2 ) < T{ Vi ) + 4eH/(3k 2 ). The final inequality follows from 
Lemma 3.5 with k = 0. From this point, the argument remains identical as Section 3.3.2 to show 
that |p(0) -£(0)| < 4ee 2 H/(3k 2 ) < eH/2, yielding \p(0) -H\<eH by the triangle inequality. 

4 Estimating Residual Moments 

To multiplicatively approximate Shannon entropy, the algorithm of Section 3.4 requires a mul- 
tiplicative approximation of Tsallis entropy. Section 5 shows that the required quantities can 
be computed. The main tool needed is an efficient algorithm for estimating residual moments. 
That is the topic of the present section. 

Define the residual a moment to be F™ B := ^22=2 1^1° = F a — |^.i|°S where we reorder the 
items such that \A±\ > | ^4.2 1 > • • • > \An\- In this section, we present two efficient algorithms to 
compute a 1 + e multiplicative approximation to F™ s for a G (0,2]. These algorithms succeed 
with constant probability under the assumption that a heavy hitter exists, say \A\\ > g ||^4||i- 
The algorithm of Section 4.2 is valid only in the strict turnstile model. Its space usage has 
a complicated dependence on a; for the primary range of interest, a E [1/3,1), the bound is 
0((e _1 / Q + e~ 2 (l — a) + log n) log m). The algorithm of Section 4.3 is valid in the general update 
model and uses 0{e~ 2 logm) bits of space. 

4.1 Finding a Heavy Element 

A subroutine that is needed for both of our algorithms is to detect whether a heavy hitter exists 
{\Ai\ > 5 ||-A||i) and to find the identity of that element. We will describe a procedure for doing 
so in the general update model. We use the following result, which essentially follows from the 
count-min sketch [8]. For completeness, a self-contained proof is given in Appendix A. 5. 

Fact 4.1. Let w £ R™ be a weight vector on n elements so that ^Wj = 1. There exists a 
family 7i of hash functions mapping the n elements to 0(l/s) bins with \7i\ = n ^ ' such that 
a random h £ TL satisfies the following two properties with probability at least 15/16. 

(1) If Wi > 1/2 then the weight of elements that collide with element i is at most e ■ ^Zj^Wj. 

(2) If maxj u>i < 1/2 then the weight of elements hashing to each bin is at most 3/4. 

We use the hash function from Fact 4.1 with e = 1/10 to partition the elements into bins, 
and for each bin maintain a counter of the net L\ weight that hash to it. If there is a heavy 
hitter, then the net weight in its bin is more than 4/5 — e(l/5) > 3/4. Conversely, if there is a 
bin with at least 3/4 of the weight then Fact 4.1 implies then there is a heavy element. 

We determine the identity of the heavy element via a group-testing type of argument: we 
maintain [log 2 n\ counters, of which the i counts the number of elements which have their i 
bit set. Thus, if there is heavy element, we can determine its i th bit by checking whether the 
fraction of elements with their i bit is at least 3/5. 

4.2 Bucketing Algorithm 

In this section, we describe an algorithm for estimating F™ s that works only in the strict turnstile 
model. The algorithm has several cases, depending on the value of a. 

Case 1: a = 1. This is the simplest case for our algorithm. We use the hash function from 
Fact 4.1 to partition the elements into bins, and for each bin maintain a count of the number 
of elements that hash to it. If there is a bin with more than 3/4 elements at the end of the 
procedure, then there is a heavy element, and it suffices to return the total number of elements 



in the other bins. Otherwise, we announce that there is no heavy hitter. The correctness follows 
from Fact 4.1, and the space required is O(-logm) bits. 

Case 2: a = (0, A) U (1,2]. Again, we use the hash function from Fact 4.1 to partition the 
elements into bins. For each bin, we maintain a count of the number of elements, and a sketch 
of the o th moment using Fact 2.1. The counts allow us to detect if there is a heavy hitter, as 
in Case 1. If so, we combine the moment sketches of all bins other than the one containing the 
heavy hitter; this gives a good estimate with constant probability. By Fact 2.1, we need only 

O U • f J^il + i) logm+ i log to) = O (fe^ + X) logm) bits. 
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Case 3: a = [s, 1). This is the most interesting case. This idea is to keep just one sketch of 
the a moment for the entire stream. At the end, we estimate F™ s by artificially appending 
deletions to the stream which almost entirely remove the heavy hitter from the sketch. 

The algorithm computes four quantities in parallel. First, Ff cs = (1 ± e')F[ es with error 
parameter e' = e l ' a , using the above algorithm with a = 1. Second, F a = (1 ± e)F a using 
Fact 2.1. Third, iq, which is trivial in the strict turnstile model. Lastly, we determine the 
identity of the heavy hitter as in Section 4.1. 

Now we explain how to estimate F™ s . The key observation is that iq — F[ cs is a very 
good approximation to A\ (assume this is the heavy hitter). So if we delete the heavy hitter 
(iq — iq rcs ) times, then there are at most A\ < e'F[ cs remaining occurrences. Define -F^ es to 
be the value of F a after processing these deletions. Clearly F™ s > (F[ es ) a , by concavity of the 
function y i— > y a . On the other hand, the remaining occurrences of the heavy hitter contribute 
at most (e'F[ es ) a . Hence, the remaining occurrences of the heavy hitter inflate F™ s by a factor 
of at most 1 + (e'-F{ cs ) a /(F[ cs ) a = 1 + e. Thus F™ s = (1 + 0(e))F™ s , as desired. The number 
of bits of space used by this algorithm is at most 

O (p- log to + (^-pr- + §) log to + log n log to) = O ( (37^ + ^pr + log n) logm, 

4.3 Geometric Mean Algorithm 

This section describes an algorithm for estimating F™ s in the general update model. At a high 
level, the algorithm uses a hash function to partition the stream elements into two substreams, 
then separately estimates the moment F a for the substreams. The estimate for the substream 
which does not contain the heavy hitter yields a good estimate of F^ cs . We improve accuracy of 
this estimator by averaging many independent trials. Detailed description and analysis follow. 
We use Li's geometric mean estimator [21] for estimating F a since it is unbiased (its being 
unbiased will be useful later). The geometric mean estimator is defined as follows. Let k and 
a be parameters. We let y = R ■ A, where A is the vector representing the stream and R is a 
k x n matrix whose entries are i.i.d. samples from an a-stable distribution. Define 
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a " [ir(f)r(i-i)sin(f£r 

The space required to compute this estimator is easily seen to be 0(k ■ logm) bits. Li analyzed 
the variance of F a as k — > 00, however for our purposes we are only interested in the case k = 3 
and henceforth restrict to only this case (one can show F a has unbounded variance for k < 3). 
Building on Li's analysis, we show the following result. 

- l2 



Lemma 4.2. There exists an absolute constant Cqm such that Var F a 



< Cgm ■ E 



F a 



Let r denote the number of independent trials. For each j € [r], the algorithm picks a function 
hj : [n] — > {0, 1} uniformly at random. For j € [r] and I € {0, 1}, define F a ,j,l — Si-/i-(i)=z l^l"- 
This is the a moment for the I substream during the j trial. 

For each j and /, our algorithm computes an estimate F a ,j,i of F a ,j,l using the geometric 
mean estimator. We also run in parallel the algorithm of Section 4.1 to discover which i £ [n] is 
the heavy hitter; henceforth assume i = 1. Our overall estimate for F™ s is then 

2 r 

The space used by our algorithm is simply the space required for r geometric mean estimators 
and the one heavy hitter algorithm. The latter uses 0(e _1 logn) bits of space [8, Theorem 7]. 
Thus the total space required is 0(r log m + e _1 logn) bits. 

We now sketch an analysis of the algorithm; a formal argument is given in Appendix A. 4. 
The natural analysis would be to show that, for each item, the fraction of trials in which the 
item doesn't collide with the heavy hitter is concentrated around 1/2. A union bound over all 
items would require choosing the number of trials to be f2(Jy logn). We obtain a significantly 
smaller number of trials by using a different analysis. Instead of using a concentration bound 
for each item, we observe that items with roughly the same weight (i.e., the value of \Ai\) are 
essentially equivalent for the purposes of this analysis. So we partition the items into classes 
such that all items in the a class have the same weight, up to a (1 + e) factor. We then apply 
concentration bounds for each class, rather than separately for each item. The number of classes 
is only R = 0(- logm), and a union bound over classes only requires 0(-? logi?) trials. 

As argued, the space usage of this algorithm is 0(r logm + e _1 logn) = 0(e~ 2 logm) bits. 

5 Estimation of Renyi and Tsallis Entropy 

This section summarizes our algorithms for estimating Renyi and Tsallis entropy. These al- 
gorithms are used as subroutines for estimating Shannon entropy in Section 3, and may be of 
independent interest. 

The techniques we use for both the entropies are almost identical. In particular, to compute 
additive approximation of T a or H a , it suffices to compute a sufficiently precise multiplicative 
approximation of the a-th moment. Due to space constraints, we present proofs of all lemmas 
and theorems from this section in the appendix. 

Theorem 5.1. There is an algorithm that computes an additive e- approximation of Renyi 

entropy in O f h"^"^ J bits of space for any a € (0, 1) U (1, 2]. 

Theorem 5.2. There is an algorithm for additive approximation of Tsallis entropy T a using 

•0(^^) bits, for a €(0,1). 



°((Sr>) bits, force (1,2]. 



In order to obtain a multiplicative approximation of Tsallis and Renyi entropy, we must 
prove a few facts. The next lemma says that if there is no heavy element in the empirical 
distribution, then Tsallis entropy is at least a constant. 



Lemma 5.3. Let xi,x%, . . . ,x n be values in [0,1] of total sum 1. There exists a positive 
constant C such that if Xi < 5/6 for all i then, for a G (0, 1) U (1, 2], 

n 

1-J2 X ?\ >C-\a-l\. 

i=l 

Corollary 5.4. There exists a constant C such that if the probability of each element is at 
most 5/6, then the Tsallis entropy is at least C for any a £ (0, 1) U (1, 2]. 

Proof. We have 

1 - v™ r a n - v n r a \ 

rp ._ 1 Z^i = l X _ I 1 2^,1=1 X l I > (-1 
Oi. -. -i I 

a — 1 \a — 1| 

m 

We now show how to deal with the case when there is an element of large probability. It 
turns out that in this case we can obtain a multiplicative approximation of Tsallis entropy by 
combining two residual moments. 

Lemma 5.5. There is a positive constant C such that if there is an element i of probability 
Xi > 2/3, then the sum of a multiplicative (1 + C ■ |1 — a\ ■ ^-approximation to 1 — Xi and 
a multiplicative (1 + C ■ |1 — a\ ■ e)-approximation to X]i^j x ? gives a multiplicative (1 + e)- 
approximation to |1 — ^ xf\, for any a € (0, 1) U (1, 2]. 
We these collect those facts in the following theorem. 

Theorem 5.6. There is a streaming algorithm for multiplicative (1 + e)-approximation of 
Tsallis entropy for any a € (0, 1) U (1, 2] using O (logm/(|l — a|e 2 )) bits of space. 

The next lemma shows that we can handle the logarithm that appears in the definition of 
Renyi entropy. 

Lemma 5.7. It suffices to have a multiplicative (1 + e)-approximation to t — 1, where t £ 
(4/9, oo) to compute a multiplicative (1 + C ■ e) approximation to log(t), for some constant C. 

We now have all necessary facts to estimate Renyi entropy for a € (0, 2]. 

Theorem 5.8. There is a streaming algorithm for multiplicative (1 + e)-approximation of 
Renyi entropy for any a € (0, 1) U (1, 2]. The algorithm uses O (logm/(|l — a|e 2 )) bits of space. 

In fact, Theorem 5.8 is tight in the sense that (l + e)-multiplicative approximation of H a for 
a > 2 requires polynomial space, as seen in the following theorem. 

Theorem 5.9. For any a > 2, any randomized one-pass streaming algorithm which (1 + e)- 
approximates H a (X) requires Q,{n l ~ 2 i a ~ 2e ~"iy e+1 l a >) bits of space for arbitrary constant 7 > 0. 

Tsallis entropy can be efficiently approximated both multiplicatively and additively also for 
a > 2, but we omit a proof of that fact in this version of the paper. 

6 Modifications for General Update Streams 

The algorithms described in Section 3 and Section 5 are for the strict turnstile model. They can 
be extended to work in the general updates model with a few modifications. 

First, we cannot efficiently and exactly compute 11^41^ = F\ in the general update model. 
However, a (1 + ^-multiplicative approximation can be computed in 0(e~ 2 logm) bits of space 
by Fact 2.1. In Section 3.2 and Section 3.3, the value of 11^4]^ is used as a normalization factor to 
scale the estimate of F a to an estimate of X^=i x f • (See, e.g., Eq. (3.1) and Eq. (3.5).) However, 

F a _ (l±s).F a _ r 1±0{£)) .^ 



(*i)« {(l±e)-F 1 ) a v v " F x 
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so the fact that iq can only be approximated in the general update model affects the analysis 
only by increasing the constant factor that multiplies e. A similar modification must also be 
applied to all algorithms in Section 5; we omit the details. 

Next, the multiplicative algorithm Section 3.4 needs to compute a multiplicative estimate of 
T(yi) using Theorem 5.6. In the general updates model, a weaker result than Theorem 5.6 holds: 
we obtain a multiplicative (l+e)-approximation of Tsallis entropy for any a € (0, 1)U(1, 2] using 
O (logm/(|l — a\ • e) 2 ) bits of space. The proof is identical to the argument in Appendix A. 6, 
except that the the moment estimator of Fact 2.1 uses more space, and we must use the residual 
moment algorithm of Section 4.3 instead of Section 4.2. Similar modifications must be made to 
Theorem 5.1, Theorem 5.2 and Theorem 5.8, with a commensurate increase in the space bounds. 

7 Future Research 

We hope that the techniques from approximation theory that we introduce may be useful for 
streaming and sketching other functions. For instance, consider the following function G a> k(x) = 
Y^i xf(logn) k , where k G N and a G [0, oo). One can show that 



1™, ;— -r = G ^ + l(x). 



p^a a — (3 

Note that G a ,o{x) is the a th moment of x, and one can attempt to estimate G a k+x by computing 
G/3 t k for P = a and (3 close to a. It is not unlikely that our techniques can be generalized to 
estimation of functions G a ^ for a € (0,2]. Can one also use our techniques for approximation 
of other classes of functions? 
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A Proofs 

A.l Proofs from Section 3.2 

Recall that x G R n is a distribution whose smallest positive value is at least 1/m. The key 
technical lemma needed is as follows. 

Lemma A.l. Let a > 1, let £ = £(a0 denote 4(a — l)i?i(x), and let 

e(a) = 2(^logn + £log(l/£) N 



Assume that £(a) < 1/4. Then H a < H 1 < H a + e(a). 

We require the following basic results. 
Claim A. 2. The following inequalities follow from convexity. 

• Let < y < 1. Then e y < 1 + 2y. 

• Let y > 0. Then 1 - y < log(l/y). 

• Let < y < 1/2. Then 1/(1 - y) < 1 + 2y. 

Claim A. 3. Let 1 < a < b and let x G W 1 . Then ||x|| 6 < ||x|| a < n 1 / a - 1 / b \\x\\ b . 

Claim A.4. If < a < (3 then H a > Hp 

Claim A. 5. If a > 1 then log (1/ ||x|[ a ) < (a - 1) • Hi- 

Proof. log(l/||x|| Q ) = ^H a {x) < (a-l)-H a (x) < {a-l)-H x {x). ■ 

Claim A. 6. Let y = (yi, . . . , y n ) and 2 = (z±, . . . , z n ) be probability distributions such that 

||y-^|li < 1/2. Then 

/ 71 

|fli(y) - ffi(«)| < ||i/ - zld • log i 

Proof. See Cover and Thomas [9, 16.3.2]. ■ 

Proof (of Lemma A.l). The first inequality follows from Claim A. 4 so we focus on the second 
one. Define f(a) = log ||x||^ and g(a) = 1 — a, so that H a = f(a)/g(a). The derivatives are 

/ («) = iTTS and 9 («) = - 1, 

IMIa 

so limQ.^1 f'(a)/g'(a) exists and equals H(x). Since limQ,^! /(a) = lim Q _»i g(a) = 0, l'Hopital's 
rule implies that limc^i H a = H(x). A stronger version of L'Hopital's rule is as follows. 

Claim A. 7. Let / : R — ► R and g : R — > R be differentiable functions such that the following 
limits exist 

lim f(a) = 0, lim g(a) = 0, and lim f (a) / g' (a) = L. 

a— >1 a— »1 ' a— »1 

Let e and (5 be such that \a — 1| < 5 implies that \f'(a)/g'(a) — L\ < e. Then \a — 1| < 5 also 
implies that \f(a)/g(a) — L\ < e. 
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Proof. See Rudin [31, p.109]. □ 

Thus, to prove our lemma, it suffices to show that \f'(a)/g'(a) — H\\ < e(a). (In fact, we 

actually need \f'(/3)/g'(P) — Hi\ < e(a) for all (3 £ (l,a], but this follows by monotonicity of 

e((3) for 0e (l,aj.) 

A key concept in this proof is the "perturbed" probability distribution x(a), defined by 

x(a)i = xf / ||x||". We have the following relationship. 



f / \ II 1 1 CK 

9(a) \\x\\ a 



Yd=l X i( l °&( l / X i) + lo g|ML ~ lo g 



X 



Er=i a; f io g(ikiL/ a; i) - (sr=i^f io § ikiu 





Z^ n ii 
a t-^ \\x\\ 

2=1 " " 


a: 
a 


H x {x{a)) 


a 


In summary, we have shown that 




f(a) ffi(x(a)) 






ff'(a) a 





1 n 9 ( II l| Q \ 

~£irii* lo s("^r ~ lo sll x l 



+ Iog(l/||x|L) 



< log(l/||x||J < (a-l)-iJi(x), 
the last inequality following from Claim A. 5. To use this bound, we observe that: 



g'ioc) 



Hi{x(a)) 



5 (a) a \ a / 



< 



/'(a) #i(x(a)) 



<?'(«) 



a 



+ |l/a-l| -Hi(x(a)) 



(A.l) 



We now substitute Eq. (A.l) into this expression, and use |l/a — 1| < a — 1 (valid since a > 1). 
This yields: 

Hi{x{a)) < (a-l)--Hi(x) + (a - 1) • ill (x(a)) (A.2) 



<?'(«) 



Recall that our goal is to analyze \f'(a)/g'(a) — H\(x)\. We do this by showing that 
H\[x{aj) ~ Hi(x), and that the right-hand side of Eq. (A.2) is at most e(a). This is done 
using Claim A. 6; the key step is bounding ||x — xfa)^. 

Claim A. 8. Suppose that 1< a < 1 + 1/(2 log n). Then 1/ \\x\\° < 1 + 3(a - l)-Hi(x). 

Proof. From Claim A. 3 and ||cc|| x = 1, we obtain 1/ \\x\\ a < n 1 " 1 '" < n a ~ l . Our hypothesis on 
a implies that 



q • log(l/ |[x|| Q ) < a -(a — l)logn < 2 • (a — l)logn < 1. 



(A.3) 
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Thus 



alog(l/\\x\\) 



<*> < l + 2-Qlog(l/||x|| a ) < l + 3(a-l)iTi(x). 



The first inequality is from Claim A. 2 and Eq. (A. 3), and the second from Claim A. 5. □ 

Recall that £ = 4(a - l)Hi(x). 

Claim A. 9. \\x — x(a)\\ 1 < £. 

Proof. To avoid the absolute values, we shall split the sum defining ||x — ^(a)!^ into two cases. 
For that purpose, let S = { i : x(a)i > X{ }. Then 



|ar-x(a)|| 1 = ^ (x(a)i 

= E x '- i 



Xi) 



.a-l 



+ ^ {xj - x(a)i) 

■ i) + 5>-(i- 



.a-l 



m 



The first sum is upper-bounded using x" < 1 and X^eS' 2 '* — ^- ^he seconci sum is upper- 
bounded using ||x||^ < 1 and 1 — x" _1 < log (l/x"" 1 ) (see Claim A. 2). 

\ l|X|la / i£S 

< 3(a-l)#i(x) + (a-l)-Hi(z), 

using Claim A. 8. This completes the proof. □ 

Thus, by our assumption that £(a) < 1/4, by Claim A. 6, by Claim A. 9, and by the fact that 
x i— * xlog(l/x) is monotonically increasing for x € (0, 1/4), we obtain that 

\H x (x) - mixia))] < £logn + £log(l/£). 



Now we assemble the error bounds. Our result from Eq. (A. 2) yields 



/'(a) 



ff'(a) 



fli(x) 



< 



/'(a) 



ff'(a) 



#i(x(a)) 



+ |£Ti(x)-fl-i(x(a))| 



< [(a-l)Hx{x) + (a-l)H 1 (x(a))\ + |#i(x) - fli(s(a))| 

< 2(a-l)iJi(x) + a-|fl"i(x)-£T 1 (x(a))| 

< 2^1ogn + eiog(l/e) N 



This completes the proof. ■ 

We now use Lemma A.l to show that H a ~ H±, if a is sufficiently small. 

Proof (of Theorem 3.1). First we focus on the multiplicative approximation. The lower bound 
is immediate from Claim A. 4, so we show the upper-bound. For an arbitrary \x € (0, 1), we have 



f < 



/' 



21og(l//i) 



< W 
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this follows since /xlog(l/ / u) < 1/2 for all fi. Let fx = /u/(21og(l///)J. Then 

/21og(l//i) < //. 

This follows since fi 2 < jl =>• 1//2 < l//x 2 =>- log(l//2) < 21og(l//i). 
The hypotheses of Theorem 3.1 give a = 1 + /2/8. Hence, 



e(a) = 8(a-l)Hi logn + log f l/(4(a - 1)H X ] 
< jiH^logn + log (2/(/i#i)) 

Since i?i > (log m)/m for any distribution satisfying our hypotheses, this is at most 



< /i-fM logn + log(l//i) + logm 

< (logm)/ii?i < {e/2)H u 

since our hypotheses give \x = e/(41og?n). Applying Lemma A.l, we obtain that 

Hi-H a < {e/2)H x 
=> (l-e/2)Hi < H a 



Hi 
H n 



< 



1 



l-e/2 



< 1 + e, 



the last inequality following from Claim A. 2. This establishes the multiplicative approximation. 
Let us now consider the above argument, replacing fi with v = e/(4 logn log m). We obtain 

e(a) < (logm)vHi < e/4, 

since H\ < logn. Thus, the additive approximation follows directly. ■ 

A. 2 Proofs from Section 3.3 

Our first task is to prove Lemma 3.5. We require a definition and two preliminary technical 
results. For any integer k > and real number a > — 1, define 



G h (a) = 5>, 1+a log fc ( 



so G? (o) = F 1+a /\\A\\\ +a . Note that G%\a) = G k+1 (a) for k > 0, and T(a) = (1 - G (o))/o. 
Claim A. 10. The A; th derivative of the Tsallis entropy has the following expression. 

k 



T^ k \a) 



(-!)**! (l-G (a)) 



7 fe+i 



^ ^-l) fc -Jfc!Gj(o ) 



,i=i 



a fc-j'+ij! 
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Proof. The proof is by induction, the case k = being trivial. So assume k > 1. Taking the 
derivative of the expression for T*- > (a) above, we obtain: 

T (fe+i) (a ) 

* fc! ( fc _j + l)(_l)(fc+i)-J Gj .( Q ) fc!(-l)*-JG j+1 (a) 
2^ a {k+l)-j+\j\ + a k -i +1 j\ 

+ (-l)fc+i(A; + l)!(G (a)-l) | (-l)fcfc!Gi(a) 
a k+2 a k+l 

> k\{-l)^-3G 3 (a) f k-j + l \ \ G k+1 (a) (-l) k ^(k + l)l(G (a) - I) 
^ a(WH+l(j-l)! V j )y a + a k + 2 

fe ( A; + i)!(-i)(fe+i)-iG i (Q) \ (-i)fc+i(fc + l)!(G (a)-l) 
I Z> a (k+l)-j+lj\ I + a fc+2 

as claimed. I 

Claim A.ll. Define S k {a) = a k+1 T^ k \a). Then, for 1 < j < k+ 1, 

In particular, for 1 < j < k, we have 
lim5i i} (a) = and lim 5? +1) (a) = fc! G fc+ i(0) so that lim T {k \a) = Gk+1 ^\ 

a^O a^O a— >0 K + 1 

Proof. We prove the claim by induction on j. First, note 



7 
J =1 



so that 



6, (a) - (-1) fc.G l( a)-^ ((j + 1) - 1)! + (J 3 !)! 

= a fc G fc+ i(a) 
Thus, the base case holds. For the inductive step with 2 < j < k + 1, we have 

i=0 \ 



* J{k-j + i + l)\ 



+ C; 2 ) (t - J+ (- l +1)+ i/ - wwWG ^^)W ) 
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The final equality holds since ( 3 Q ) = ( J ) = 1, (^_ 2 ) = (,-_i) = 1> an d by Pascal's formula 

( J 7 2 ) + (£9 = (fcl) for < i < j - 3. 

For 1 < j < A;, every term in the above sum is well-defined for a = and contains a power 
of a which is at least 1, so lim a _>o S): (a) = 0. When j = k + 1, all terms but the first term 
contain a power of a which is at least 1, and the first term is k\Gk+i(a), so lim a _,o S k (a) = 
A;!Gfc_|_i(0). The claim on lim a _»o T(k)(a) thus follows by writing T^ k '(a) = S k (a)/a k+1 then 
applying l'Hopital's rule k + 1 times. ■ 



Proof (of Lemma 3.5). We will first show that 

T (k) ( e \ _ G fc+ i(0) 

(k + 1) log m J k+1 



6elog fe (m)H(x) 
~ k + 1 



Let S k (a) = a k+1 T^ k \a) and note T^ k \a) = S k (a)/a k+1 . By Claim A.10, lim a _> S k (a) = 
0. Furthermore, lim a ^o S^ = for all 1 < j < k by Claim A. 11. Thus, when analyzing 
lim a _»o 5F (a)/(o + )"' for < j < k, both the numerator and denominator approach and 
we can apply l'Hopital's rule (here (a fc+1 )"' denotes the jth. derivative of the function a k+1 ). 
By k + 1 applications of l'Hopital's rule, we can thus say that T^ k > (a) converges to its limit 

at least as quickly as S { k k +1) (a)/(« fc+1 ) (fc+1) = sf +l \a)/{k + 1)! does (using Claim A.7). We 
note that Gj(a) is nonnegative for j even and nonpositive otherwise. Thus, for negative a, each 
term in the summand of the expression for Su {a) in Claim A. 11 is nonnegative for odd k 
and nonpositive for even k. As the analyses for even and odd k are nearly identical, we focus 
below on odd k, in which case every term in the summand is nonnegative. For odd k, S k (a) 
is nonpositive so that S k (a) is monotonically decreasing. Thus, it suffices to show that 
Si (—e/((k + 1) log m))/(/c + 1)! is not much larger than its limit. 

o(fc+l) ( e \ ST k ( k \ k\ ( e V r ( e 

°k \ (k+l)logmJ Z^i=0 \i) i! ^ (k+1) log m J ^k+l+i y (k+l)logm 



(k + iy. (k + iy. 



* ^fgG)((^ik^)' |G '- (o>l 

s ^T§ fc '((¥nk^)' |Gt+I+ ' (0)l 

l + 2e A / £ \* 

^ TTtE i Gfc+i+iO 

k + 1 r^ \ log my 



l + 2e 



i=0 

k 



* SrE £l i G w(o)i 



+ - ,=0 



(l + 2e)|G fc+ i(0)| , l + 2e 



/>■ 



- *Ti + TTT^ e|G/c+l(0)l 






~ k + 1 + -^-^}^^\o E k {m)H(x) 
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< |G fc+ i(0)| 6elog k (m)H(x) 
jfe+1 jfc+l 

The first inequality holds since Xj > 1/m for each i, so that x^ e og < j7i e /C( fe + 1 ) i°g m ) < 

m e/iogm < e ^ < i -)_ 2e for e < 1/2. The final inequality above holds since e < 1/2. 

The lemma follows since |Gfc+i(0)| < log (m)H(x). ■ 

Proof (of Lemma 3.6). Let Pj denote the j th Chebyshev polynomial. We will prove for all 
j > 1 that 

P i _ 1 (x)<P y (x)<P i _ 1 (x)(l + ^). 

For the first inequality we observe Pj-i € Cj, so we apply Fact 3.3 together with the fact that 
Pj(y) is strictly positive for y > 1 for all j. 

For the second inequality, we induct on j. For the sake of the proof define P_i(x) = 1 so 
that the inductive hypothesis holds at the base case d = 0. For the inductive step with j ; > 1, 
we use the recurrence definition of Pj(x) and we have 



P i+ i(x) = PjWfl + ^j+iPjW-Pj. 



(x)) 



^ ^li+u+U) 



l) + ^wl 



2(J + 1)~ 



P,(x) 1 + 



A-' 



A. 3 Proofs from Section 4 

Fact A.12. For any real z > 0, T(z + 1) = zT(z). 

Fact A. 13. For any real z > 0, sin(z) < z. 

Fact A. 14 (Euler's Reflection Formula). For any real z, T(2;)r(l — z) = 7r/sin(-7rz). 
Definition A. 15. The function V : M + — > K is defined by 



F(a) 
Lemma A. 16. 



[|r(f)r(i)sin(^)] 3 i 
[|r(|)r(|)sin(^)] 6 



rfi) 

lim V(a) - 



r( 2 )6 



Proof. Define u(a) = r(2a/3)(vra/3) = r(2a/3)(2a/3)(7r/2) = r((2a/3)+l)(vr/2) by Fact A.12. 
By the continuity of T(-) on M + , lim Q ^o u(a) = r(l)7r/2 = tt/2. Define f(a) = T(2a/3) sin(7ra/3). 
Then /(a) < u(a) for all a > by Fact A. 13, and thus lim a ^of(a) < ir/2. Now define 
£s(a) = r(2a/3)(l— 8)(ira/3). By the definition of the derivative and the fact that the derivative 
of sin(oi) evaluated at a = 1 is 1, it follows that \/5 > Be > s.t. < a < e => sin(a) > (1— <5)a. 
Thus, V5 > 3e > s.t. < a < e => 4(a) < /(«), and so V<5 > we have that lim a _> /(a) > 
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lim a ^o £s(a) = (1 — 5)ir/2. Thus, lim a ->of(a) > ir/2, implying lim a -,of(a) = it/2. Similarly 
we can define g{a) = r(a/3) sin(7ra/6) and show lim a ^ofl'(aO = ?r/2. 
Now, 

Via) = VlT K3J V '\ 

[|r(l) 5 («)] 6 

Thus lim Q _> V(a) = r(l/3) 3 /r(2/3) 6 as claimed. ■ 

Proof (of Lemma 4.2). Li shows in [21] that the variance of the geometric mean estimator 
with k = 3 is V(a)F^. As r(z) and sin(z) are continuous for z € R+, so is V(a). Furthermore 
Lemma A.16 shows that lim Q ^ V{a) exists (and equals (r(l/3) 3 /T(2/3) 6 ) - l). We define V(0) 
to be this limit. Thus V(a) is continuous on [0, 2], and the extreme value theorem implies there 
exists a constant Com such that V(a) < Cqm on [0,2]. ■ 

A. 4 Detailed Analysis of Geometric Mean Residual Moments Algorithm 

Formally, define R = log 1+JL . m , and let I z = j i : (1 + j^) z < L4*| < (1 + ^) 2+1 } for < 
z < R. Let z* satisfy (1 + j^) z * < \A X \ < (1 + ^) z * +1 - For 1 < j < r and < z < R, define 

th 



x j,z = Ylieh 1 h J (i)^h J (i)- We now analyze the j th trial. 



Claim A. 17. E 
Proof. We have 



2-F, 



E 



2 • F aJtl _ hj{l) 



2-E 



£w 



2-E E 

z 

2-E E 



Ei- 4 



L^(i)^-(l) 



i\ ■ i-hj{i)^hj{l) 



^2((l±e)(l + er) a -l hj{i) ^ 



(i) 



ieh 



= (l± £ r-^(l + erE[2X J ,,]. 

Z 

Clearly E [2 ■ Xj z ] is \I Z \ — 1 if z = 2;* and |/ 2 | otherwise. Thus 

^(l + e rE[2.X iiZ ] = £((l±e)L4 4 |) a = (lie^-F^ 



i>2 



Since a < 2, (1 ± e) a = 1 ± O(e), so this shows the desired result. ■ 

We now show concentration for X z := £ Xa<?<r x j,z- By independence of the hj's, Chernoff 
bounds show that X z = (1 ±e) E [X z ] with probability at least 1 — exp(— 0(e 2 r)). This quantity 
is at least 1 — 8 (r +1 \ if we choose r = ci |"e~ 2 (loglog ||-A||i + log(c3/e))] . The good event is the 
event that, for all z, X z = (1 ± e) E [ X z ] ; a union bound shows that this occurs with probability 
at least 7/8. So suppose that the good event occurs. Then a calculation analogous to Claim A. 17 
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shows that 



Er^.i-MD = (i± £ ) Q -E( 1 + £ ) ZQ - 2 ^ 

= (l±er-^(l + er-(l±e)E[2X, 

z 

= (1 ± 0(e)) . 1£». 



(A.4) 



Recall that F^ es = Xw=i f-Faj,i-/i,-(i)- Since the geometric mean estimator is unbiased, we 
also have that 



E 



fi 



E 



Z-S 



aj.l-Zi^l) 



(A.5) 



We conclude the analysis by showing that the random variable F^ es is concentrated. By 
Lemma 4.2 applied to each substream, and properties of variance, we have 



Var 



F 1 



4 v^ r ~ 

Z2Z^ Var F «,i,i-^(i) 



< I^m.e 



F a ,j,i-h 3 {i) 



< 



Com 



■E 



F 1 



i=i 



Chebyshev's inequality therefore shows that 



Pr 



F r a cs = (1 ± e) E 



F r 



Var 



> 1 



(e-E 



F 1 ; 



F 1 



_ ^M > g /7) 



e 2 r 



by appropriate choice of constants. This event and the good event both occur with probability 
at least 3/4. When this holds, we have 



/Tires 



(l±e)E 



(l±e)E 



Z_> r ^a,j,i-h. 



(i) 



;i ± 0(e)) ■ F- 



by Eq. (A.5) and Eq. (A.4). 

A.5 Proofs from Section 4.2 

Proof (of Fact 4.1). Let B = [20/e] be the number of bins. Let Ji be a pairwise independent 
family of hash functions, each function mapping [n] to [B] . Standard constructions yield such a 
family with \7i\ = n ' '. We will let h be a randomly chosen hash function from Tri. 

For notational simplicity, suppose that x\ = maxj x%. Let Eij be the indicator variable for the 
event that h(i) = j, so that E \£%j ] = 1/B and Var [£ij ] < 1/B. Let Xj be the random variable 
denoting the weight of the items that hash to bin j, i.e., Xj = ^ X{ ■ £ij. Since Yli x % = 1) we 
have E[Xj] = 1/B and Var [Xj } < \\xf 2 /B. 

Suppose that x\ > 1/2. Let Y be the fraction of mass that hashes to x\s bin, excluding x\ 

itself. That is, Y = Ei> 2 x i ■ S,h(i)- Note that E [ y ] = (£i>2 x i)/ B < ( e / 20 ) ■ (E*> 2 ^)- By 
Markov's inequality, 



Pr 



Y>^'(Ei>2^) < Pr[y>16E[V]] < 1/16. 
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Suppose that x\ < 1/2. This implies, by convexity, that ||x|| 2 < 1/2. Let f3 = ^/2/3 < 5/6. 



Then 



Pr[|.Y,-l/ B |> /3 ] <Mil<^ 



Thus, by a union bound, 

3 



Pr [ 3j such that Xj > (3 + l/B } < 



4' 



Suppose we want to test if x\ > 1/2 by checking if there's a bin of mass at least 5/6. As 
argued above, the failure probability of one hash function is at most 3/4. If we choose ten 
independent hash functions and check that all of them have a bin of at least 5/6, then the 
failure probability decreases to less than 1/16. ■ 

A. 6 Proofs from Section 5 

Proof (of Theorem 5.1). Let mi be the number of times the i-th element appears in the 
stream. Recall that m is the length of the stream. By computing a (1 + ^-approximation to 
the a th moment (as in Fact 2.1) and dividing by ||>1||™, we get a multiplicative approximation 
to Fq/II-AHj* = ||x||°. We can thus compute the value 



1-Q 



Setting e' = e ■ |1 — a\, we obtain an additive approximation algorithm using 



l-a| 1 

I i 19 ' I 

e z ■ a — lr e • a — 1 



( ( -21 _ 12 + H. ii ) lo & m ) : 0(logm/(|l - a\ ■ e 2 )) 



bits, as claimed. 



Proof (of Theorem 5.2). If a € (0,1), then because the function x a is concave, we get by 
Jensen's inequality 



n /-\\ a 

£„-<„. (I) 



n 1 ^. 



If we compute a multiplicative (1 + (1 — a) ■ e ■ ^"^-approximation to the a th moment, we 
obtain an additive (1 — a) ■ e- approximation to (Y27=i x f) ~ -*-• This in turn gives an additive 
e-approximation to T a . By Fact 2.1, 

°(( ((l-alr t V^ + (l-a)? t .n-0 '° gm ) = W*™**^--**) 

bits of space suffice to achieve the required approximation to the a th moment. 

For a > 1, the value -F a /||A||" is at most 1, so it suffices to approximate F a to within 
a factor of 1 + (a — 1) • e. For a 6 (1,2], again using Fact 2.1, we can achieve this using 
0(logm/((a — l)e 2 )) bits of space. ■ 

Proof (of Lemma 5.3). Consider first a G (0, 1). For x € (0, 5/6], 

_ = ,->- >i +Cl .(i- Q) , 
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for some positive constant C\. The last equality follows from convexity of (5/6) y as a function 
of y. Hence, 



J]<> £)(1 + Ci(l - a))x; = 1 + d(l 



a) 



i=l 



i=l 



and furthermore, 



»-£< 



i=\ 



£= 

\i=\- 



1 >Ci •(!-«) =Ci-|a-l| 



When a E (1, 2], then for x E (0, 5/6], 






x a ~ x < ( - 
6 



a-l 



< 1 - C 2 ■ (a - 1), 

for some positive constant C 2 . This implies that 

n n 

J2 X ? ^ ^xii 1 - C 2 ■ (o - 1)) = 1 - C 2 • (a- 1) 

and 



t=l 



8=1 



!"D 



i=l 



1 -£}a? > C 2 • (a - 1) = C 2 • |a - 1| 



i=l 



To finish the proof of the lemma, we set C = min{Ci, C 2 }. ■ 

Proof (of Lemma 5.5). We first argue that a multiplicative approximation to |1 — xf | can be 
obtained from a multiplicative approximation to 1 — x%. Let g(y) = 1 — (1 — y) a . Note that 
g(l — Xi) = 1 — xf. Since 1 — Xi E [0, 1/3], we restrict the domain of g to [0, 1/3]. The derivative 
of g is g'(y) = a(l — y) a ~ l . Note that g is strictly increasing for a E (0, 1) U (1, 2]. For a E (0, 1), 
the derivative is in the range [a, |a]. For a E (1,2], it always lies in the range [|a, a]. In both 
cases, a (1 + ^^-approximation to y suffices to compute a (1 + ^-approximation to g{y). 
We now consider two cases: 

• Assume first that a E (0, 1). For any x E (0, 1/3], we have 



x a /l 
~x ~ \3 



a-l 



jl— a 



>l + Ci(l-a), 



for some positive constant C\. The last inequality follows from the convexity of the function 
3 . This means that if x, < 1, then 



Ej/itf ^ Ej-Wi + Ciq-q)) (i - xj){i + dji - a)) 
-r-^> r^ = ^— ^U 1 



1 — X 
Since x,- < xf < 1, we also have 



1 - Xi 



q . 



> — — — — > 1 + 6i(l — a). 



1 - Xf I - Xi 



This implies that if we compute a multiplicative 1 + (1 — a)e/I?i-approximations to both 
1 — xf and Yljjti x< j' f° r sufficiently large constant D\, we compute a multiplicative (1 + e)- 
approximation of (X)*=i ^j ) ~~ 1- 
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The case of a S (1, 2] is similar. For any x £ (0, 1/3], we have 



X" /V' 



-<{- 3 ) <i-ft(«-D, 

for some positive constant C 2 - Hence, 

E^^ ^ E^^(l-g 2 (a-l)) (l-x. t )(l-C 2 (a-l)) 

— — — - < — — ■ = — — ■ = 1 - 2 (a - i), 



and because xf < Xi 



y^ „q v^ „a 
1-xf - 1-xt ~ A ; 



This implies that if we compute a multiplicative 1 + (a — l)e/D2- a PP rox i ma ti° ns to both 
1 — xf and EiVi x ?' for sufficiently large constant L>2; we can compute a multiplicative 



(1 + ^-approximation to 1 — E*=l x< j- 



Proof (of Theorem 5.6). We run the algorithm of Section 4.1 to find out if there is a very 
heavy element. This only requires O(logra) words of space. 

If there is no heavy element, then by Lemma 5.3 there is a constant C £ (0, 1) such that 
|1 — J2i x ?\ — C| Q ~~ 1|- We want to compute a multiplicative approximation to |1 — Ei x f\- We 
know that the difference between E* xf and 1 is large. Therefore, if we compute a multiplicative 
(l+^|a— l|Ce)-approximation to E« x f ; we obtain an additive (||a— l|Ce E« ^^-approximation 
to EiS?. IfE<*?<2, then 

i|a-l|CeXVaf |a-l|Ce 
_£_i 1 1 — 1_ ^ j 1 — _ 

If Ei a# > 2, then 

i|a-l|CeV.xf 1, 

|l-E^fl "2 1 ' 

In either case, we obtain a multiplicative (1 + e)-approximation to |1 — Ei x ? I> which i n turn 
yields a multiplicative approximation to the Tsallis entropy. We now need to bound the amount 
of space we use in this case. We use the estimator of Fact 2.1, which uses O (log m/(\a — l|e 2 )) 
bits in our case. 

Let us focus now on the case when there is a heavy element. By Lemma 5.5 it suffices to 
approximate F[ cs and F™ s , which we can do using the algorithm of Section 4.2. The number of 
bits required is 

logm \ \ c( ^ a ~^' logm \ - Q^ logm 



e ■ \a — 1\J \ (e ■ \a — 1|) 2 / \e 2 • \a — 1 



Proof (of Lemma 5.7). For t £ [4/9,1], the derivative of the logarithm function lies in the 
range [a, b], where a and b are constants such that < a < b. This implies that in this 
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case, a (1 + e)-approximation to t — 1 gives a H — e approximation to log(t). We are given 
y € [1 — t, (1 + e)(l — £)], and we can assume that y G [1 — t, min{5/9, (1 + e)(l — t)}]. We have 



and 



log(t) <-log(l-y), 



log(l-y) < -log(l-(l + e)(l-t)) _ -log(t-e(l-t)) 



log(t) - log(t) - log(t) 

log(i) + (- log(i - e(l - £)) + log(t)) 



< 



1 + 



-log(t) 
\og(t-e(l-t)) + log(t) 



- log(t) 

< g(l - t) ■ max^ [max{t _ £(1 _ t)i4/9}it] (log(z)y 

(1 -t) •min 2e [ 4/ g il ](log(z)) / 

(1 - t) • min ze [ 4/9)1] (log(z))' 
^ 1 e(l-t)-b 1 b 
[1 — t) ■ a a 

Consider now t > 1. We are given y £ [t — 1, (1 + e)(t — 1)], and we have 

log(i) < log(y + 1) < log((l + e)(t - 1) + 1). 
Furthermore, 

log((l + e)(t-l) + l) < log(t) + log((l + e)(t -!) + !)- log(t) 
log(t) " log(t) 

log(t + (t - l)e) - log(t) 
log(t) 
i | // +(f - 1)£ (log(z))^z 

jJ(log(*))'<fe 

< (^-l)gmax 2 g [t;f+(f _ 1)£] (log(z)) / 

(t-l)max ze [i )t ](log(z))' 

(t - l)e 

< 1+1^ = 1+,. 

Hence, we get a good multiplicative approximation to log(i). ■ 

Proof (of Theorem 5.8). We use the algorithm of Section 4.1 to check if there is a single 
element of high frequency. This only requires O(logm) bits of space. 

If there is no element of frequency greater than 5/6, then the Renyi entropy for any a is 
greater than the min-entropy H^ = — logmaxjXj > log(6/5). Therefore, in this case it suffices 
to run the additive approximation algorithm with e' = log(6/5)e to obtain a sufficiently good 

estimate. To run that algorithm, we use O I .^T^ I bits of space. 

Let us consider the other case, when there is an element of frequency at least 2/3. For 
a £ (1, 2], we have 
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and for a € (0, 1), X)ILi x f — 1- Therefore, by Lemma 5.7, it suffices to compute a multiplicative 
approximation to |1 — ^ xf |, which we can do by Lemma 5.5. By algorithms from Section 4.3 
and Section 4.2, we can compute the multiplicative (1 + 0(|1 — a|e))-approximations required 
by Lemma 5.5 with the same space complexity as for the approximation of Tsallis entropy (see 
the proof of Theorem 5.6). ■ 

Proof (of Theorem 5.9). The proof is nearly identical to that of Theorem 3.1 in [2]. We 
need merely observe that if H a is a (1 + ^-approximation to H a , then m a { l + £ )2( 1 ~ a ) Ha is a 
multiplicative m ae -approximation to F a . From here, we set t = cm 6 n 1 ' a and argue identically 
as in [2] via a reduction from t-party disjointness; we omit the details. ■ 
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